{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Agenda\n",
      "\n",
      "- Define the problem and the approach\n",
      "- Data basics: loading data, looking at your data, basic commands\n",
      "- <p style=\"color: red\">Handling missing values</p>\n",
      "- Intro to scikit-learn\n",
      "- Grouping and aggregating data\n",
      "- Feature selection\n",
      "- Fitting and evaluating a model\n",
      "- Deploying your work"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#In this workbook you will\n",
      "- Learn about imputation and what it's used for\n",
      "- Use the KNearestNeighbors algorithm to impute data\n",
      "- Come up with your own imputation strategy"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Imputation\n",
      "Imputation is fairly intuitive. For the missing data in our dataset, we're going to replace it with values that come from similar records in our dataset that aren't null.\n",
      "To do this we're going to use the NearestNeighbors algorithm."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import pylab as pl"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.read_csv(\"./data/credit-data-post-import.csv\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Cross Validation\n",
      "We're going to use the simplest type of cross validation. we'll simply split our data into 2 groups: training and test. we'll use the training set to calibrate our model and then use the test set to  evaluate how effective it is."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "is_test = np.random.uniform(0, 1, len(df)) > 0.75\n",
      "train = df[is_test==False]\n",
      "test = df[is_test==True]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(train), len(test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "(112919, 37081)"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Be sure to calibrate the imputation with the training set"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.neighbors import KNeighborsRegressor\n",
      "\n",
      "income_imputer = KNeighborsRegressor(n_neighbors=1)\n",
      "\n",
      "#split our data into 2 groups; data containing nulls and data \n",
      "# not containing nulls we'll train on the latter and make\n",
      "# 'predictions' on the null data to impute monthly_income\n",
      "train_w_monthly_income = train[train.monthly_income.isnull()==False]\n",
      "train_w_null_monthly_income = train[train.monthly_income.isnull()==True]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cols = ['debt_ratio', 'age']\n",
      "income_imputer.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n",
        "          n_neighbors=1, p=2, weights='uniform')"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "new_values = income_imputer.predict(train_w_null_monthly_income[cols])\n",
      "new_values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "array([ 0.,  0.,  1., ...,  1.,  0.,  0.])"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_w_null_monthly_income['monthly_income'] = new_values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#combine the data back together\n",
      "train = train_w_monthly_income.append(train_w_null_monthly_income)\n",
      "len(train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "112919"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test['monthly_income_imputed'] = income_imputer.predict(test[cols])\n",
      "test.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>serious_dlqin2yrs</th>\n",
        "      <th>revolving_utilization_of_unsecured_lines</th>\n",
        "      <th>age</th>\n",
        "      <th>number_of_time30-59_days_past_due_not_worse</th>\n",
        "      <th>debt_ratio</th>\n",
        "      <th>monthly_income</th>\n",
        "      <th>number_of_open_credit_lines_and_loans</th>\n",
        "      <th>number_of_times90_days_late</th>\n",
        "      <th>number_real_estate_loans_or_lines</th>\n",
        "      <th>number_of_time60-89_days_past_due_not_worse</th>\n",
        "      <th>number_of_dependents</th>\n",
        "      <th>monthly_income_imputed</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td> 1</td>\n",
        "      <td> 0.766127</td>\n",
        "      <td> 45</td>\n",
        "      <td> 2</td>\n",
        "      <td>  0.802982</td>\n",
        "      <td>  9120</td>\n",
        "      <td> 13</td>\n",
        "      <td> 0</td>\n",
        "      <td> 6</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td>  1846</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.957151</td>\n",
        "      <td> 40</td>\n",
        "      <td> 0</td>\n",
        "      <td>  0.121876</td>\n",
        "      <td>  2600</td>\n",
        "      <td>  4</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 18000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.213179</td>\n",
        "      <td> 74</td>\n",
        "      <td> 0</td>\n",
        "      <td>  0.375607</td>\n",
        "      <td>  3500</td>\n",
        "      <td>  3</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td>  2100</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.116951</td>\n",
        "      <td> 27</td>\n",
        "      <td> 0</td>\n",
        "      <td> 46.000000</td>\n",
        "      <td>   NaN</td>\n",
        "      <td>  2</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td>     0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>15</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.548458</td>\n",
        "      <td> 64</td>\n",
        "      <td> 0</td>\n",
        "      <td>  0.209892</td>\n",
        "      <td> 11362</td>\n",
        "      <td>  7</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "      <td>  3928</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 38,
       "text": [
        "    serious_dlqin2yrs  revolving_utilization_of_unsecured_lines  age  \\\n",
        "0                   1                                  0.766127   45   \n",
        "1                   0                                  0.957151   40   \n",
        "5                   0                                  0.213179   74   \n",
        "8                   0                                  0.116951   27   \n",
        "15                  0                                  0.548458   64   \n",
        "\n",
        "    number_of_time30-59_days_past_due_not_worse  debt_ratio  monthly_income  \\\n",
        "0                                             2    0.802982            9120   \n",
        "1                                             0    0.121876            2600   \n",
        "5                                             0    0.375607            3500   \n",
        "8                                             0   46.000000             NaN   \n",
        "15                                            0    0.209892           11362   \n",
        "\n",
        "    number_of_open_credit_lines_and_loans  number_of_times90_days_late  \\\n",
        "0                                      13                            0   \n",
        "1                                       4                            0   \n",
        "5                                       3                            0   \n",
        "8                                       2                            0   \n",
        "15                                      7                            0   \n",
        "\n",
        "    number_real_estate_loans_or_lines  \\\n",
        "0                                   6   \n",
        "1                                   0   \n",
        "5                                   1   \n",
        "8                                   0   \n",
        "15                                  1   \n",
        "\n",
        "    number_of_time60-89_days_past_due_not_worse  number_of_dependents  \\\n",
        "0                                             0                     2   \n",
        "1                                             0                     1   \n",
        "5                                             0                     1   \n",
        "8                                             0                     0   \n",
        "15                                            0                     2   \n",
        "\n",
        "    monthly_income_imputed  \n",
        "0                     1846  \n",
        "1                    18000  \n",
        "5                     2100  \n",
        "8                        0  \n",
        "15                    3928  "
       ]
      }
     ],
     "prompt_number": 38
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "test['monthly_income'] = np.where(test.monthly_income.isnull(), test.monthly_income_imputed,\n",
      "                                  test.monthly_income)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print pd.value_counts(train.monthly_income.isnull())\n",
      "print pd.value_counts(test.monthly_income.isnull())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "False    112919\n",
        "False    37081\n"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train.to_csv(\"./data/credit-data-trainingset.csv\", index=False)\n",
      "test.to_csv(\"./data/credit-data-testset.csv\", index=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 48
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Trying your own imputation strategy\n",
      "Partner with the person next to you and think of other ways you might be able to impute missing valus for the `monthly_income` column.\n",
      "\n",
      "Things to consider:\n",
      "\n",
      "- What are some other methods you could use for replacing nulls?\n",
      "- What about handling outlying values? Should a montly income of $3MM be treated as missing?\n",
      "- Check out the [scikit-learn docs](http://scikit-learn.org/stable/auto_examples/imputation.html) for more examples\n",
      "- How might you evaluate the effectiveness of your imputation?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##We just did the following\n",
      "\n",
      "- Split our data into a training set for building our model and a test set for evaluating its performance\n",
      "- Used KNearestNeighbors to fill in missing values for `monthly_income`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}